103 research outputs found
Predicting Dynamic Memory Requirements for Scientific Workflow Tasks
With the increasing amount of data available to scientists in disciplines as
diverse as bioinformatics, physics, and remote sensing, scientific workflow
systems are becoming increasingly important for composing and executing
scalable data analysis pipelines. When writing such workflows, users need to
specify the resources to be reserved for tasks so that sufficient resources are
allocated on the target cluster infrastructure. Crucially, underestimating a
task's memory requirements can result in task failures. Therefore, users often
resort to overprovisioning, resulting in significant resource wastage and
decreased throughput.
In this paper, we propose a novel online method that uses monitoring time
series data to predict task memory usage in order to reduce the memory wastage
of scientific workflow tasks. Our method predicts a task's runtime, divides it
into k equally-sized segments, and learns the peak memory value for each
segment depending on the total file input size. We evaluate the prototype
implementation of our method using workflows from the publicly available
nf-core repository, showing an average memory wastage reduction of 29.48%
compared to the best state-of-the-art approac
Towards Energy Consumption and Carbon Footprint Testing for AI-driven IoT Services
Energy consumption and carbon emissions are expected to be crucial factors
for Internet of Things (IoT) applications. Both the scale and the
geo-distribution keep increasing, while Artificial Intelligence (AI) further
penetrates the "edge" in order to satisfy the need for highly-responsive and
intelligent services. To date, several edge/fog emulators are catering for IoT
testing by supporting the deployment and execution of AI-driven IoT services in
consolidated test environments. These tools enable the configuration of
infrastructures so that they closely resemble edge devices and IoT networks.
However, energy consumption and carbon emissions estimations during the testing
of AI services are still missing from the current state of IoT testing suites.
This study highlights important questions that developers of AI-driven IoT
services are in need of answers, along with a set of observations and
challenges, aiming to help researchers designing IoT testing and benchmarking
suites to cater to user needs.Comment: Presented at the 2nd International Workshop on Testing Distributed
Internet of Things Systems (TDIS 2022
Selecting Efficient Cluster Resources for Data Analytics: When and How to Allocate for In-Memory Processing?
Distributed dataflow systems such as Apache Spark or Apache Flink enable
parallel, in-memory data processing on large clusters of commodity hardware.
Consequently, the appropriate amount of memory to allocate to the cluster is a
crucial consideration.
In this paper, we analyze the challenge of efficient resource allocation for
distributed data processing, focusing on memory. We emphasize that in-memory
processing with in-memory data processing frameworks can undermine resource
efficiency. Based on the findings of our trace data analysis, we compile
requirements towards an automated solution for efficient cluster resource
allocation.Comment: 4 pages, 3 Figures; ACM SSDBM 202
Ruya: Memory-Aware Iterative Optimization of Cluster Configurations for Big Data Processing
Selecting appropriate computational resources for data processing jobs on
large clusters is difficult, even for expert users like data engineers.
Inadequate choices can result in vastly increased costs, without significantly
improving performance. One crucial aspect of selecting an efficient resource
configuration is avoiding memory bottlenecks. By knowing the required memory of
a job in advance, the search space for an optimal resource configuration can be
greatly reduced.
Therefore, we present Ruya, a method for memory-aware optimization of data
processing cluster configurations based on iteratively exploring a
narrowed-down search space. First, we perform job profiling runs with small
samples of the dataset on just a single machine to model the job's memory usage
patterns. Second, we prioritize cluster configurations with a suitable amount
of total memory and within this reduced search space, we iteratively search for
the best cluster configuration with Bayesian optimization. This search process
stops once it converges on a configuration that is believed to be optimal for
the given job. In our evaluation on a dataset with 1031 Spark and Hadoop jobs,
we see a reduction of search iterations to find an optimal configuration by
around half, compared to the baseline.Comment: 9 pages, 5 Figures, 3 Tables; IEEE BigData 2022. arXiv admin note:
substantial text overlap with arXiv:2206.1385
Detecting and Mitigating Network Packet Overloads on Real-Time Devices in IoT Systems
Manufacturing, automotive, and aerospace environments use embedded systems
for control and automation and need to fulfill strict real-time guarantees. To
facilitate more efficient business processes and remote control, such devices
are being connected to IP networks. Due to the difficulty in predicting network
packets and the interrelated workloads of interrupt handlers and drivers,
devices controlling time critical processes stand under the risk of missing
process deadlines when under high network loads. Additionally, devices at the
edge of large networks and the internet are subject to a high risk of load
spikes and network packet overloads.
In this paper, we investigate strategies to detect network packet overloads
in real-time and present four approaches to adaptively mitigate local deadline
misses. In addition to two strategies mitigating network bursts with and
without hysteresis, we present and discuss two novel mitigation algorithms,
called Budget and Queue Mitigation. In an experimental evaluation, all
algorithms showed mitigating effects, with the Queue Mitigation strategy
enabling most packet processing while preventing lateness of critical tasks.Comment: EdgeSys '2
How Workflow Engines Should Talk to Resource Managers: A Proposal for a Common Workflow Scheduling Interface
Scientific workflow management systems (SWMSs) and resource managers together
ensure that tasks are scheduled on provisioned resources so that all
dependencies are obeyed, and some optimization goal, such as makespan
minimization, is fulfilled. In practice, however, there is no clear separation
of scheduling responsibilities between an SWMS and a resource manager because
there exists no agreed-upon separation of concerns between their different
components. This has two consequences. First, the lack of a standardized API to
exchange scheduling information between SWMSs and resource managers hinders
portability. It incurs costly adaptations when a component should be replaced
by another one (e.g., an SWMS with another SWMS on the same resource manager).
Second, due to overlapping functionalities, current installations often
actually have two schedulers, both making partial scheduling decisions under
incomplete information, leading to suboptimal workflow scheduling.
In this paper, we propose a simple REST interface between SWMSs and resource
managers, which allows any SWMS to pass dynamic workflow information to a
resource manager, enabling maximally informed scheduling decisions. We provide
an exemplary implementation of this API for Nextflow as an SWMS and Kubernetes
as a resource manager. Our experiments with nine real-world workflows show that
this strategy reduces makespan by up to 25.1% and 10.8% on average compared to
the standard Nextflow/Kubernetes configuration. Furthermore, a more widespread
implementation of this API would enable leaner code bases, a simpler exchange
of components of workflow systems, and a unified place to implement new
scheduling algorithms.Comment: Paper accepted in: 2023 23rd IEEE International Symposium on Cluster,
Cloud and Internet Computing (CCGrid
Lotaru: Locally Predicting Workflow Task Runtimes for Resource Management on Heterogeneous Infrastructures
Many resource management techniques for task scheduling, energy and carbon
efficiency, and cost optimization in workflows rely on a-priori task runtime
knowledge. Building runtime prediction models on historical data is often not
feasible in practice as workflows, their input data, and the cluster
infrastructure change. Online methods, on the other hand, which estimate task
runtimes on specific machines while the workflow is running, have to cope with
a lack of measurements during start-up. Frequently, scientific workflows are
executed on heterogeneous infrastructures consisting of machines with different
CPU, I/O, and memory configurations, further complicating predicting runtimes
due to different task runtimes on different machine types.
This paper presents Lotaru, a method for locally predicting the runtimes of
scientific workflow tasks before they are executed on heterogeneous compute
clusters. Crucially, our approach does not rely on historical data and copes
with a lack of training data during the start-up. To this end, we use
microbenchmarks, reduce the input data to quickly profile the workflow locally,
and predict a task's runtime with a Bayesian linear regression based on the
gathered data points from the local workflow execution and the microbenchmarks.
Due to its Bayesian approach, Lotaru provides uncertainty estimates that can be
used for advanced scheduling methods on distributed cluster infrastructures.
In our evaluation with five real-world scientific workflows, our method
outperforms two state-of-the-art runtime prediction baselines and decreases the
absolute prediction error by more than 12.5%. In a second set of experiments,
the prediction performance of our method, using the predicted runtimes for
state-of-the-art scheduling, carbon reduction, and cost prediction, enables
results close to those achieved with perfect prior knowledge of runtimes
Magpie: Automatically Tuning Static Parameters for Distributed File Systems using Deep Reinforcement Learning
Distributed file systems are widely used nowadays, yet using their default
configurations is often not optimal. At the same time, tuning configuration
parameters is typically challenging and time-consuming. It demands expertise
and tuning operations can also be expensive. This is especially the case for
static parameters, where changes take effect only after a restart of the system
or workloads. We propose a novel approach, Magpie, which utilizes deep
reinforcement learning to tune static parameters by strategically exploring and
exploiting configuration parameter spaces. To boost the tuning of the static
parameters, our method employs both server and client metrics of distributed
file systems to understand the relationship between static parameters and
performance. Our empirical evaluation results show that Magpie can noticeably
improve the performance of the distributed file system Lustre, where our
approach on average achieves 91.8% throughput gains against default
configuration after tuning towards single performance indicator optimization,
while it reaches 39.7% more throughput gains against the baseline.Comment: Accepted at The IEEE International Conference on Cloud Engineering
(IC2E) conference 202
Towards a Cognitive Compute Continuum: An Architecture for Ad-Hoc Self-Managed Swarms
In this paper we introduce our vision of a Cognitive Computing Continuum to
address the changing IT service provisioning towards a distributed,
opportunistic, self-managed collaboration between heterogeneous devices outside
the traditional data center boundaries. The focal point of this continuum are
cognitive devices, which have to make decisions autonomously using their
on-board computation and storage capacity based on information sensed from
their environment. Such devices are moving and cannot rely on fixed
infrastructure elements, but instead realise on-the-fly networking and thus
frequently join and leave temporal swarms. All this creates novel demands for
the underlying architecture and resource management, which must bridge the gap
from edge to cloud environments, while keeping the QoS parameters within
required boundaries. The paper presents an initial architecture and a resource
management framework for the implementation of this type of IT service
provisioning.Comment: 8 pages, CCGrid 2021 Cloud2Things Worksho
- …